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Abstract 

Background: Comparing tlie metabolic patliways of different species is useful for understanding metabolic 
functions and can help in studying diseases and engineering drugs. Several comparison techniques for metabolic 
pathways have been introduced in the literature as a first attempt in this direction. The approaches are based on some 
simplified representation of metabolic pathways and on a related definition of a similarity score (or distance measure) 
between two pathways. More recent comparative research focuses on alignment techniques that can identify similar 
parts between pathways. 

Results: We propose a methodology for the pairwise comparison and alignment of metabolic pathways that aims at 
providing the largest conserved substructure of the pathways under consideration. The proposed methodology has 
been implemented in a tool called MP-Align, which has been used to perform several validation tests. The results 
showed that our similarity score makes it possible to discriminate between different domains and to reconstruct a 
meaningful phylogeny from metabolic data. The results further demonstrate that our alignment algorithm correctly 
identifies subpathways sharing a common biological function. 

Conclusion: The results of the validation tests performed with MP-Align are encouraging. A comparison with 
another proposal in the literature showed that our alignment algorithm is particularly well-suited to finding the 
largest conserved subpathway of the pathways under examination. 
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Background 

Metabolism is the chemical system that generates the 
essential components for life. All living (micro) organisms 
possess an intricate network of metabolic routes for the 
biosynthesis of amino acids, nucleic acids, lipids and car- 
bohydrates and for the catabolism of different compounds 
driving cellular processes. Subsystems of metabolism 
dealing with specific functions are called metabolic path- 
ways. Over the last ten years these pathways have been the 
subject of a great deal of research, conducted primarily 
through two kinds of studies: one focusing on the analysis 
of single pathways, the other on the comparative analysis 
of a set of pathways. 

The studies that compare metabolic pathways of dif- 
ferent species can provide interesting information on 
their evolution and may help in understanding metabolic 
functions, which are important in studying diseases 
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and identifying pharmacological targets. In the litera- 
ture many techniques have been proposed for comparing 
metabolic pathways of different organisms. Each approach 
chooses a representation of metabolic pathways that mod- 
els the information of interest, proposes a similarity or 
a distance measure and possibly supplies a tool for per- 
forming the comparison. The automatization of the whole 
process is enabled by the knowledge stored in metabolic 
databases such as KEGG [1], BioModels [2] or Meta- 
Cyc [3]. 

More recent comparative research has proceeded by 
focusing on alignment techniques that can identify sim- 
ilar parts between pathways, providing further insight 
for drug target identification [4,5], meaningful recon- 
struction of phylogenetic trees [6,7], and identification 
of enzymes clusters and missing enzymes [8,9]. Here too 
approaches in the literature vary: some consider multiple 
pathways and identify their frequent or conserved sub- 
graphs [10,11]; others also build their alignments [12-21]. 

We propose a methodology for the pairwise comparison 
and alignment of metabolic pathways that aims at pro- 
viding the largest conserved substructure of the pathways 



o 



BioMed Central 



© 2014 Alberich et al.; licensee BioMed Central Ltd. This is an Open Access article distributed under the terms of the Creative 
Commons Attribution License (http://creativecommons.Org/licenses/by/2.0), which permits unrestricted use, distribution, and 
reproduction in any medium, provided the original work is properly credited. 



Alberich etal. BMC Systems Biology 2014, 8:58 
http://www.biomedcentral.eom/1752-0509/8/58 



Page 2 of 16 



under consideration. The methodology relies on a hyper- 
graph representation of metabolic pathways and defines 
a reaction similarity score that takes into account the 
chemical similarity and homology between pairs of reac- 
tions. The alignment technique uses the reaction similar- 
ity score and the pathway topology to identify the largest 
conserved subpathway between the two given pathways. 
The proposed methodology has been implemented in a 
tool called MP- Align, which has been used to perform 
several validation tests reported herein. 

Methods 

This section describes the methodology proposed for the 
pairwise comparison and alignment of metabolic path- 
ways. We represent metabolic pathways as directed hyper- 
graphs and define a reaction similarity score based on 
both compound and enzyme similarities. On the basis of 
these choices we define the alignment algorithm, which 
has been implemented in MP- Align. 

Hypergraph representation of a metabolic pathway 

A directed hypergraph is a mathematical structure H = 
(VfE) where V is a finite set of nodes and £ is a set of 
directed hyperedges. A directed hyperedge is an ordered 
pair of subsets of nodes E = (X, Y); X is the set of input 
nodes of E while Y is its set of output nodes. 

Metabolic pathways can be easily represented as 
directed hypergraphs: metabolites, enzymes and com- 
pounds can be modeled as nodes and reactions as hyper- 
edges. Despite the simplicity of this representation, we 
made the modeling choices described below. 

— We do not represent ubiquitous substances, such as 
//2O, phosphate, ADP and ATP as hypergraph nodes. 
The same is true for enzymes, which are represented 
as reaction attributes and used to compute the 
reaction similarity. 

— Most of the reactions in metabolic pathways are 
reversible. A reversible reaction can occur in two 
directions, from the reactants to the products 
(forward reaction) or vice versa (backward reaction). 
The direction depends on the kind of reaction, on the 
concentration of the metabolites, and on conditions 
such as temperature and pressure. We model 
reversible reactions by two corresponding 
hyperedges, one for the forward reaction and the 
other for the backward reaction. 

— In a metabolic pathway one can distinguish between 
internal and external metabolites. The former are 
entirely produced and consumed in the network; the 
latter represent sources or sinks, that is, connection 
points produced or consumed by other pathways. We 
represent external metabolites as input only {source) 
or output only (sinks) nodes. 



Figures 1 and 2 show a metabolic pathway and its cor- 
responding hypergraph representation. More specifically. 
Figure 1 shows part of the KEGG Arginine and pro- 
line metabolism pathway for H, Sapiens, focusing on the 
compounds and enzymes directly involved in the Urea 
Cycle; Figure 2 depicts the hypergraph representation of 
the cycle itself. Purple nodes in the picture represent 
compounds and grey nodes are hyperedges represent- 
ing reactions. Each hyperedge reports both the reaction 
name (in KEGG nomenclature) and the EC number [22] of 
the catalyzing enzyme. For each hyperedge, the incoming 
arrows represent the input compounds of the correspond- 
ing reaction and the outgoing arrows represent the output 
compounds. Note that the reversible reaction R00SS7 is 
translated into two corresponding hyperedges, one for the 
forward reaction and the other for the backward reaction, 
which can be distinguished by the suffix 'rev. 

Reaction similarity score 

In the literature there are several approaches to defining 
a reaction similarity score. Some represent each reaction 
through the enzyme that catalyzes it and define a score 
based on enzyme similarity, e.g. [7,19,23]. Other more 
recent proposals, e.g. [17,18], consider both compound 
and enzyme similarities. We employ the reaction similar- 
ity score defined in [18]. More precisely, let Ri = (li.Ei, Oi) 
denote a hyperedge representing a reaction, where It is 
the set of its input nodes (substrates), the enzyme that 
catalyzes the reaction and 0/ the set of its output nodes 
(products). The similarity score for every pair of reac- 
tions Ri = (JhEi, Oi) and Rj = {Ij^Ej, Oj) is given by the 
following formula [18]: 

SimReact(Riy Rj) = SimEnz(Ei, Ej) • We 

+ SimCompiliJj) - Wi (1) 
+ SimComp(Oi, Of) • Wq 

where SimEnz{Ei,Ej) is the enzyme similarity between 
Ei and Ej, SimComp(Ii, Ij) is the compound similarity 
between the input node sets and SimComp(Oi, Oj) 
is the compound similarity between the output node sets 
O/, Oj. The parameters We, Wi and Wq are fixed to We = 0.4 
and Wi = Wo = 0.3 since, as stated in [18], they provide a 
good balance between enzymes and compounds. 

For the enzyme and compound similarities in (1) we 
made the following choices. 

— For enzymes, we use the EC hierarchical similarity 
measure that is based on the comparison of the 
unique EC number (Enzyme Commission number) 
associated to each enzyme, which represents its 
catalytic activity. The EC number is a 4-level 
hierarchical scheme, d1.d2.ds.d4, developed by the 
International Union of Biochemistry and Molecular 
Biology (lUBMB) [22]. Enzymes with similar EC 
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Figure 1 Part of the KEGG pathway /\rg/>j/V7e and proline metabolism for H. Sapiens. This figure sliows tlie compounds and enzymes directly 
involved in the Ureo cycle. 



classifications are functional homologues but do not 
necessarily have similar amino acid sequences. 
Given two enzymes e = <ii.<i2-'^3-'^4 and 

= d[,d2^d^'^^d^/^, their similarity S(e, e^) depends on 
the length of the common prefix of their EC numbers: 

S{ey e') = max{/ = 1, 2, 3, 4 : dj = df, j = 1, . . . , /}/4 

For instance, the similarity between arginase 
{e = 3.5.3.1) and creatinase {e' = 3.5.3.3) is 0.75. 
For compounds, we use a similarity based on the 
similarity measure computed by the SIMCOMP 
(SIMilar COMPound) [24] tool. Given two 
compounds, the tool represents their chemical 
structure as graphs and outputs a measure of their 
maximal common substructure. 



Since a reaction may have more than one input 
(output) compound, we need a way to combine the 
similarity between pairs of compounds computed by 
SIMCOMP. Given two sets X and Y of compounds, 
the score SimComp{X, Y) is computed by: 

- defining a complete bipartite graph in which 
the compounds in X and Y are nodes and the 
weight of each edge (x,y) e X x Y is the 
similarity value of x and y computed by 
SIMCOMP; 

- applying the maximum weighted bipartite 
matching algorithm to the resulting graph to 
obtain the best match between X and Y; 

- summing the scores of the best match and 
dividing it by max{|X|, iri}. 
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Figure 2 Hypergraph representation of the Urea Cycle shown in Figure 1. Purple nodes represent compounds and grey nodes are hyperedges 
representing reactions. They specify the catalyzing enzyme as an attribute. For each reaction, the incoming arrows represent the input compounds 
and the outgoing arrows represent the output compounds. Note that a reversible reaction (e.g. reaction R00557) is represented by a forward 
reaction (grey node with label R00557) and a backward one (grey node with label R00557rev). 
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The MP-Align alignment algorithm 

This section illustrates the MP-Align alignment algo- 
rithm. The algorithm receives as input two directed 
hypergraphs Hi = (Vi,£i) and H2 = (Vi^Ei) repre- 
senting two metabolic pathways and gives their similarity 
score and alignment as output. MP-Align has been imple- 
mented in Python. The tool is freely available at http:// 
bioinfo.uib.es/~recerca/MPAlign. 
The main steps of MP-Align follow. 

Reaction path computation 

The first step of the alignment algorithm represents Hi 
and H2 as suitable paths of reactions called reaction paths. 
Given a directed hypergraph H representing a metabolic 
pathway, a reaction path is a sequence of reactions (hyper- 
edges) p = RiR2y • • - yRk such that: 

— is a reaction having a source node (i.e. an input 
only node); 

— for each G [1, /c], i 7^ Ri and Rj are different 
reactions; 

— for each / g [1, A" — 1], some of the output nodes of Ri 
are input nodes of Ri-\-i; 

— the length k of the path p is maximal. 

We denote by IZh the set of all the reaction paths in the 
hypergraph //. It is obtained through an in-depth search 
algorithm iterating over the source nodes of H. 

This step results in the sets IZhi and 7^/f2' which are the 
reaction paths of Hi and H2y respectively. 

Reaction path alignment 

The second step establishes a first correspondence 
between Hi and H2 in terms of their sets of reaction paths 
IZhi and 1Zh2 • This is done by performing an all-against- 
all alignment of the paths in IZhi and 1Zh2 ♦ More precisely, 
two reaction paths p g IZhi and p' e 1Zh2 are aligned 
using the classical Smith- Waterman sequence alignment 
algorithm [25], where the similarity between a reaction 
R in the path p and a reaction R^ in the path p^ is given 
by SimReact{RfR'). The score of the obtained sequence 
alignment is denoted by scorePathip^p^). 

Reaction path matching 

The third step refines the correspondence between Hi and 
H2 by defining a matching a c 7^/^^ x 1Zh2 that associates 
a path in TZhi with its 'most similar' path in 7Zh2' This 
is done by defining a complete bipartite graph where the 
nodes are the reaction paths in TZm and 7Zh2 and the edge 
weight between two nodes (paths) p and p' is the score 
scorePathip^p') of their sequence alignment obtained in 
the previous step. The matching a is the result of the max- 
imum weighted bipartite matching algorithm applied to 
the complete bipartite graph. 



Recall that a matching M on a bipartite graph is a subset 
of edges such that no two edges in M share an endpoint. 
The cost of M is the sum of the cost of its edges. A match- 
ing is called a maximum weight matching if its cost is at 
least as great as the cost of any other matching. 

Consider, for example, the KEGG pathway Argi- 
nine and proline metabolism for the organisms Homo 
Sapiens (hsa00330) and Methanocaldococcus Jannaschii 
(mja00330). Once they have been represented as hyper- 
graphs, the matching between their reaction paths and 
the corresponding score can be computed, as shown in 
Table 1. 

Reaction matching 

The fourth step translates the reaction path matching a 
into a well-defined matching between reactions in Hi and 
reactions in H2. This is done by analyzing the alignments 
of all pairs of reaction paths (p,p') e a and by build- 
ing a corresponding match- frequency matrix M whose 
rows and columns represent the reactions (hyperedges) of 
Hi and H2, respectively. Each entry mij of the matrix M 
counts the number of times that the reaction Ri in Hi is 
aligned to the reaction Rj in H2 in all pairs of reaction 
paths (p,p^) e a. 

Suppose, for example, that reaction Ri appears in k reac- 
tion paths in IZm and that Ri is aligned to Rj k^ times (with 
k^ < k) in the corresponding paths of 7Zh2 (through a). 
In this case, the match-frequency matrix records the value 

m,; = k'. 

Once the matrix M has been determined, the best match 
between reactions is sought, taking care to associate each 
reaction in Hi with exactly one reaction in H2. This is 
done by employing, once again, the maximum weighted 
bipartite matching algorithm: given the frequency matrix 
M as input, it produces a matching p c £1 x £2 as out- 
put, which provides the final reaction matching between 
Hi and//2. 

Final score and hypergraph alignment 

The fifth and last step of the algorithm determines the 
similarity score and the alignment of the two given hyper- 
graphs. Intuitively, the similarity score considers all pairs 
of their 'most similar' reactions (determined by p) and 
sums the score of the most similar' paths they belong to 
(determined by a), thus taking into account the topology 
of the two given pathways. Formally, the similarity score 
of Hi and H2 is defined as follows: 

^{R,R')ep J^^^scorePath(R, R^)) 



Score(Hi,H2) = 



max{\Eil\E2\} 



where 



maxscorePath{RyR^)) — max{scorePath{p,p') \ (p,p') G a, 
Rep.R! ep\{R.R!)ep\ 
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Table 1 Reaction paths and alignment 

Path alignment hsa00330-mja00330 



Table 1 Reaction paths and alignment (Continued) 
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The final alignment of Hi and H2 is defined in terms 
of their largest conserved substructure (sub-hypegraph). 
More precisely, the alignment of Hi and H2 is determined 
by using the reaction matching p to build a relational 
graph G as follows: 

— the nodes of G are the reactions in Hi 

— an edge {Ri, Rj), with Ri, Rj reactions in Hi, is 
introduced in G if and only if 

— some output nodes of Ri are input nodes of Rj, 
i.e. they are connected hyperedges in Hi, and 

— some output nodes of p (Ri) are input nodes 
of p{Rj), i.e. their images through p are also 
connected hyperedges in /f2- 

Intuitively, the relational graph G expresses the connec- 
tions between the reactions matched by p. The largest 
connected subgraph in the relational graph G corre- 
sponds to the largest conserved substructure (subpath- 
way) between Hi and H2 through p and defines the final 
alignment of the two hypergraphs. 

Lets consider once again the hypergraphs corre- 
sponding to the KEGG pathway Arginine and proline 
metabolism for H, Sapiens (hsa00330) and M, Jannaschii 
(mja00330). The final alignment obtained by MP- Align is 
shown in Table 2. In this case, the largest conserved sub- 
structure (subpathway) contains the common reactions 
appearing in the Urea Cycle (highlighted in boldface). 



Table 2 Final alignment hsa00330-mja00330 

hsa-mja alignment < - > 



Enzyme 



R01398 


< — > 


ec:2.1.3.3 


R01954 


< — > 


ec:6.3.4.5 


R00566 


< — > 


ec:4.1.1.19 


R01920 


< — > 


ec:2.5.1.16 


R00178 


< — > 


ec:4.1.1.50 


R02869 


< — > 


ec:2.5.1.16 


R01157 


< — > 


ec:3.5.3.11 


R01086 


< — > 


ec:4.3.2.1 



Common reactions appearing in the Urea Cycle are highlighted in boldface. 



Alberich etal. BMC Systems Biology 2014, 8:58 
http://www.biomedcentral.eom/1752-0509/8/58 



Page 6 of 16 



Complexity and execution time 

The complexity of the MP- Align algorithm is exponential 
in the size of the two input hypergraphs. This is already 
true in its first step, the Reaction path computation. Nev- 
ertheless, in our experience, MP- Align works fine on the 
hypergraphs representing metabolic pathways. To give an 
idea of the MP-Align efficiency, we report its execu- 
tion times for the phylogeny recovery test illustrated in 
the next Section. It is a complex test that compares all 
the common pathways of eight selected organisms: there 
are 40 common pathways and there are 1440 pairwise 
comparisons and alignments to be performed; that is, MP- 
Align is executed 1440 times. We used a server with 16 
processors at 2500 MHz and 24 GB of RAM memory. 
Since MP-Align is sequentially implemented, each pair- 
wise comparison was performed by one processor. For 
this test, 30% of the pairwise comparisons and alignments 
were executed in 0.6 seconds at most; 60% were executed 
in 1.23 seconds at most; 90% were executed in 5.61 sec- 
onds at most and the 100% were executed in 4570.88 
seconds at most. More precisely, only four pairwise com- 
parisons and alignments were performed in more than 
one hour. 

Results and discussion 

This section reports the tests performed with MP-Align 
to validate our similarity score and alignment algorithm. 
The statistical analysis was done using the R [26] basic 
package. 

The first group of experiments employed cluster anal- 
ysis methods to assess whether our similarity score and 
alignment algorithm could use metabolic information to 
provide organism classifications that are correct from the 
evolutionary point of view. The second group of experi- 
ments sought to validate the recognition and alignment 
of pairs of pathways that are known to contain function- 
ally similar subunits but have different reaction sets and 
topologies. 

Data analysis 

The first test of a similarity score between objects is typ- 
ically cluster analysis, in which biological data objects are 
partitioned into groups such that the objects in each group 
share common traits. 

First test on the Glycolysis pathway 

The first test considered the Glycolysis pathway of all 
organisms in the KEGG database, which currently con- 
tains 1758 organisms: 52 Animals, 118 Archaea, 1491 
Bacteria, 53 Fungi, 18 Plants and 51 Protists. We used 
MP-Align to compute the similarity score of all pairs of 
organisms and then converted the similarity score into the 
following distance measure: 

d(Hi,H2) = yj2{l-Score{Hi,H2)) (2) 



The results were visualized and analyzed using a classical 
multidimensional scaling (MDS) method. We represented 
the considered pathways as points in a two-dimensional 
space: the more distant the points in space, the less similar 
the corresponding pathways with respect to the consid- 
ered distance. The results are shown on the left side of 
Figure 3. Note that Bacteria appears in the whole Glycol- 
ysis universe of the two-dimensional MDS. This could be 
due to the fact that there are considerably more Bacteria 
than other organisms, and a higher dimensional represen- 
tation is required to discriminate between them and the 
other domains. 

The test was repeated with all the previous domains 
except the Bacteria. Moreover, after noting that some of 
the KEGG Glycolysis pathways are identical for differ- 
ent organisms, we selected one representative from each 
group of organisms with an identical pathway. Table 3 
shows the groups of organisms with identical Glycolysis 
pathways. Note that the various groups are homogeneous 
w.r.t. the classification into Bacteria, Archaea, Protists, 
Fungi, Plants and Animals, up to one group comprising 
Arthropods and Plants. We ended up with 160 different 
Glycolysis pathways. The results of this test are shown on 
the right side of Figure 3. Note that Protists are scattered 
throughout the whole space, while Archaea are clearly 
separated from Animals, Plants and Fungi. 

Second test on the Glycolysis pathway 

This test combined hierarchical clustering and pathway 
alignment. The idea was first to compare a set of pathways 
using our similarity score and produce a hierarchical clus- 
tering, and then to use our alignment algorithm to look 
for the largest conserved motifs in each cluster. The lat- 
ter was done by computing the pairwise alignments of the 
pathways in each cluster (in a predetermined order) and 
by considering their common set of aligned reactions, that 
is, the intersection of their largest common motif. The 
overall goal was to explore whether the alignment tech- 
nique could help in validating, or detecting the flaws of, 
the clustering results. Consider, for instance, two organ- 
isms having an identical pathway that forms a connected 
hypergraph. Now suppose that a reaction is removed from 
one of the pathways disconnecting its hypergraph. In this 
case, the similarity score considers the two organisms very 
close together, while their largest common motif reveals 
their structural difference. In fact, the comparison of two 
given pathways is based on their underlying sets of reac- 
tions and ignores their structure. A subsequent alignment 
phase includes structural information as well. 

We focused on the Glycolysis pathway of Animals. In 
KEGG there are currently 53 distinct Animals having 25 
distinct Glycolysis pathways. Table 3 shows the groups of 
organisms with an identical Glycolysis pathway in each 
row. Here as well, we took just one representative from 
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Figure 3 Two-dimensional projections of tlie Glycolysis patliways of all organisms in the KEGG database (left) and all organisms up to 
Bacteria (right). Red points correspond to Animals, green points to Arcliaea, yellow points to Fungi, pink points to Plants, black points to Protists 
and blue points to Bacteria. Note that in the projection on the left. Bacteria appears in the whole Glycolysis universe. By removing Bacteria, we can 
observe, on the right, that Protists are scattered throughout the whole space while Archaea are clearly separated from Animals, Plants and Fungi. 



each group of Animals. We performed the hierarchical 
clustering using Wards method [27] as well as the sin- 
gle, average and complete linkage methods to obtain a 
hierarchical clustering of the 25 pathways. All the meth- 
ods form a distinguished cluster of Vertebrates, but do 
not allow for a fine grain distinction within the Inver- 
tebrates. We chose the dendrogram obtained by Wards 
method, because it better separates Vertebrates and 
Invertebrates. 

The dendrogram can be cut at different heights to 
obtain different partitions of the 25 pathways. We consid- 
ered the cuts producing a total number of clusters ranging 
from 3 to 20, resulting in 18 different partitions. This 
allowed us to observe how the clusters evolve by incre- 
menting their total number. For each partition, we looked 
for the conserved motifs in each cluster using the proce- 
dure described above, and we observed how the common 
motifs evolve as the number of clusters increases. In 
Figure 4 we show the most relevant partitions: we con- 
sider 3 clusters (top left dendrogram), 8 clusters (top 
right dendrogram), 12 clusters (bottom left dendrogram) 
and 19 clusters (bottom right dendrogram), respectively. 
Each leaf in the dendrograms represents a specific organ- 
ism or the representative of a group sharing an identical 
Glycolysis pathway. The label of each leaf reports the clas- 
sification of the organism, the number of represented 
organisms (within parenthesis), the organism name (in 
KEGG nomenclature), the cluster number, and the size of 
the common motif in the cluster (in terms of the number 
of reactions). For singleton clusters, the latter is just the 



number of reactions in the largest connected component 
of the organism itself. 

One can note how the clusters evolve by increment- 
ing their total number, and how the common motifs for 
each cluster become more and more significative. In par- 
ticular. Vertebrates are separated from all other Animals 
from the very start, and their alignment confirms that 
they form a very cohesive cluster. In fact, in the top left 
and right dendrograms, the Vertebrates cluster has a com- 
mon motif composed of 31 reactions. In the bottom left 
and right dendrograms this cluster is refined into two 
different clusters, with a common motif of size 48 and 
50, respectively. In the top left dendrogram none of the 
other clusters share a common motif. This means that 
there are structural differences among the pathways in 
each cluster that could not be captured by the similarity 
score. In the top right dendrogram only cluster number 2 
lacks a common motif. This remains true for cluster num- 
ber 3 (composed of the same organisms) in the bottom 
left dendrogram. A closer look at the Glycolysis path- 
way of these organisms reveals that the Aedes Aegypty 
(aag) Glycolysis pathway is disconnected, so it can hardly 
share a common motif with the other organisms. When 
considering the 19 final clusters in the bottom right den- 
drogram, Aedes Aegypty forms a singleton cluster, and the 
other organisms are divided into two clusters, both having 
quite significative conserved motifs. The structural differ- 
ence of the Aedes Aegypty Glycolysis pathway, invisible to 
the similarity score, could be revealed by the alignment 
phase. 
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Table 3 Organisms sharing an identical Glycolysis pathway 
in the KEGG database 



Equal Glycolysis pathways 


Classification 


see, kla, vpo, zro, dha, pic, 




pgu, lei, cal, ctp, cdu, clu. 


Fungi 


bfu, nfi, aor, afv, pes, cpw, ure 




dan, der, dpe, dse, dwi, dya. 




dgr, dmo, dvi, aga, cqu, nvi. 


Animals/Arthropods 


nrny hrli «;rnn mhr 


Plants 


hsa, ptr, pon, mcc, mmu, rno, ami, 


Animals/Vertebrates 


bta, ecb, mdo, gga, acs, xtr, dre 




sso, sis, sia, sim, sid, siy, sin, sii 


Archaea 


ath, aly, pop, rcu, vvi, zma, ppp 


Plants 


Ith, ncr, pan, mgr, fgr, afm, act 


Fungi 


1 1 lltr, 1 1 II 1 IL|, II III lA, 1 1 II 1 IZ, 1 1 II 1 lU, 1 1 IUI\ 


Ml v_l Idtrd 


mja, mvu, mfs, nriae, nnvn 


Archaea 


inth, nning, msi, mel, mew 


Archaea 


ago, yli, Ibc 


Fungi 


tml, cci, scm 


Fungi 


pfh, pbe, pkn 


Protists 


hia, htu, hxa 


Archaea 


pab, ton, tba 


Archaea 


dka, dmu, tag 


Archaea 


pel, pyr, pog 


Archaea 


hhi, hbo 


Archaea 


cfa, mgp 


Animals/Vertebrates 


Gin, dpo 


Animals/Ascidians 


cbr, bmy 


Animals/Nematodes 


olu, Ota 


Plants 


ppa, cgr 


Fungi 


smp, pte 


Fungi 


cne, cnb 


Fungi 


ehi, edi 


Protists 


pfd, pyo 


Protists 


tan, tpv 


Protists 


mif, mig 


Archaea 


mac, mba 


Archaea 


mbu, mmh 


Archaea 


mhu, mem 


Archaea 


mpl, fpl 


Archaea 


hsi, hmu 


Archaea 


tac, tvo 


Archaea 


pfu, tko 


Archaea 


pyn, pya 


Archaea 



Each row-box shows the organisms sharing the same pathway and the 
corresponding classification. 



Other organisms whose Glycolysis pathway is discon- 
nected are Tribolium Castaneum (tea), Apis Mellifera 
(ame) and Trichinella Spiralis (tsp). Notice in the dendro- 
grams that, as soon as these organisms are isolated (by 
increasing the number of clusters), the conserved motifs 
in the newly formed clusters can evolve. 

For the sake of completeness we repeated the same 
test without including the organisms with a disconnected 
Glycolysis pathway. Figure 5 shows the hierarchical clus- 
tering obtained by Ward s method and exhibits a partition 
into 3 clusters. By comparing the resulting dendrogram 
with the top left dendrogram in Figure 4 one can notice 
that all clusters now share a quite significant motif, which 
is to say, the absence of the outlier organisms allow them 
to be more cohesive. 

Recovering phiylogenies 

One of the questions that arises when comparing 
metabolic pathways is whether it is possible to recon- 
struct robust phylogenetic trees from non-genomic data 
such as metabolic pathways. In [7] the authors argue that 
this is indeed the case, by presenting a method to assess 
the structural similarity of metabolic pathways for sev- 
eral organisms. On the basis of their similarity measure, 
the authors were able to reconstruct phylogenies similar 
to the NCBI reference taxonomy [28]. One of their exper- 
iments considered all the common metabolic pathways 
(taken from KEGG) of the following eight organisms: A, 
Fulgidus (afu), C. Perfringens (cpe), H, Influenzae (hin), L, 
Innocua (lin), M Jannaschii (mja), M, Musculus (mmu), N, 
Meningitidis (nme) and R, Norvegicus (rno). They belong 
to Bacteria (cpe, hin, lin, nme), Archaea (mja, afu) and 
Animals (mmu, rno). 

We repeated the same experiment using our similar- 
ity score. We performed the pairwise comparison of all 
organisms for each common pathway and combined the 
obtained scores as follows. 

For any pair of organisms with k common pathways, we 
used the average score 

AverageScore{Oi,02) = ^i=i^^^^^^^^'^^^^'^^ 

K 

and the following distance measure 

d{Oi, O2) = ^2(1 - AverageScore{Oi, O2)). 

The average score is suitable in this case because it 
makes it possible to capture comprehensive information 
from the comparison among all common pathways of the 
given organisms. Once all the distance measures between 
organisms were obtained, we made a hierarchical clus- 
tering using the single, average and complete linkage 
methods. The three methods produced exactly the same 
clustering, thereby confirming the robustness of the aver- 
age score employed. The result is reported on the right 
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Vertebrates(1)tgu2 48 
Vertebrates(2) cfa 2 48 
Vertebrates(1)xla2 48 
Vertebrates(l) ssc 1 50 
Vertebrates(14) hsa 1 50 
Vertebrates(l) oaa 1 50 
Cnidarians(l) nve 1 1 41 
Ascidians(2) cin 4 42 
Placozoans(l) tad 4 42 
Nematodes(l) eel 4 42 
Nematodes(2) cbr 4 42 
Cnidarians(l) hmg 12 43 
Echinoderms(l) spu 5 23 
Arthropods(l) dme 5 23 
Arthropods(12) dan 3 0 
Arthropods(l) phu 3 0 
Arthropods(l) aag 3 0 
Lancelets(l) bfo 3 0 
Arthropods(l) api 3 0 
Nematodes(l) tsp 10 16 
Arthropods(l) dsi 6 29 
Arthropods(l) ame 7 27 
Flatworms{1) smm 7 27 
Arthropods{1) tea 8 9 
Arthropods(l) isc 9 36 




Vertebrates(l) tgu 1 31 
Vertebrates(2) cfa 1 31 
Vertebrates(l) xia 1 31 
Vertebrates(l) ssc 1 31 
Vertebrates(14) hsa 1 31 
Vertebrates(l) oaa 1 31 
Cnidarians(l) nve 3 0 
Ascidians(2) cin 3 0 
Placozoans(l) tad 3 0 
Nematodes(l) eel 3 0 
Nematodes(2) cbr 3 0 
Cnidarians(l) hmg 3 0 
Echinoderms(l) spu 3 0 
Arthropods(l) dme 3 0 
Arthropods{12) dan 2 0 
Arthropods(l) phu 2 0 
Arthropods(l) aag 2 0 
Lancelets(l) bfo 2 0 
Arthropods(l) api 2 0 
Nematodes(l) tsp 2 0 
Arthropods(l) dsi 2 0 
Arthropods(l) ame 2 0 
Flatworms{1) smm 2 0 
Arthropods(l) tea 2 0 
Arthropods(l) isc 2 0 




Vertebrates(1)tgu4 51 
Vertebrates(2) cfa 2 51 
Vertebrates(l) xla2 51 
Vertebrates(l) ssc 3 51 
Vertebrates(14) hsa 1 54 
Vertebrates(l) oaa 1 54 
Cnidarians(l) nve 18 41 
Ascidians(2) cin 6 44 
Placozoans(l) tad 6 44 
Nematodes(l) eel 15 44 
Nematodes(2) cbr 15 44 
Cnidarians(l) hmg 19 43 
Echinoderms(l) spu 7 26 
Arthropods(l) dme 8 48 
Arthropods(12) dan 9 38 
Arthropods(l) phu 9 38 
Arthropods{1) aag 1 1 21 
Lancelets(l) bfo 5 40 
Arthropods(l) api 5 40 
Nematodes(l) tsp 16 16 
Arthropods(l) dsi 10 29 
Arthropods(l) ame 12 27 
Flatworms(l) smm 17 39 
Arthropods(l) tea 13 9 
Arthropods(l) isc 14 36 




Vertebrates(l) tgu 1 31 
Vertebrates{2) cfa 1 31 
Vertebrates(l) xIa 1 31 
Vertebrates{1) ssc 1 31 
Vertebrates(14) hsa 1 31 
Vertebrates{1) oaa 1 31 
Cnidarians(l) nve 3 39 
Aseidians(2) cin 3 39 
Placozoans(l) tad 3 39 
Nematodes(l) eel 3 39 
Nematodes(2) cbr 3 39 
Cnidarians(l) hmg 8 43 
Echinoderms(l) spu 4 23 
Arthropods(l) dme 4 23 
Arthropods(12) dan 2 0 
Arthropods(l) phu 2 0 
Arthropods(l) aag 2 0 
Lancelets(l) bfo 2 0 
Arthropods(l) api 2 0 
Nematodes(l) tsp 7 16 
Arthropods(l) dsi 5 27 
Arthropods(l) ame 5 27 
Flatworms(l) smm 5 27 
Arthropods(l) tea 6 8 
Arthropods(l) isc 6 8 




Figure 4 Dendrograms of the hierarchical clustering with partitions into 3 (top left), 8 (top right), 12 (bottom left) and 19 (bottom right) 
clusters for the Glycolysis pathways of all Animals in the KEGG database. 



Vertebrates{1) tgu 1 


31 — 


Vertebrates(2) cfa 1 


31 — 


Vertebrates{1 ) xIa 1 


31 — 


Vertebrates(l) ssc 1 


31 — 


Vertebrates(14) hsa 1 


31 — 


Vertebrates(l) oaa 1 


31 — 


Echinoderms(l) spu 3 


16 — 


Arthropods(l) dme 3 


16 — 


Ascidians{2) cin 3 


16 — 


Placozoans(l) tad 3 


16 — 


Nematodes(l) eel 3 


16 — 


Nematodes{2) cbr 3 


16 — 


Arthropods(12) dan 2 


12 — 


Arthropods(l) phu 2 


12 — 


Laneelets{1) bfo 2 


12 — 


Arthropods{1) api 2 


12 — 


Arthropods(l) dsi 2 


12 — 


Arthropods(l) isc 2 


12 — 


Cnidarians{1) hmg 2 


12 — 


Flatworms(l) smm 2 


12 — 


Cnidarians(l) nve 2 


12 — 



Figure 5 Dendrogram of the hierarchical clustering with partition into 3 clusters for the Glycolysis pathways of all Animals in the KEGG 
database having a connected Glycolysis pathway. 
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of Figure 6: our phylogenetic tree coincides with the one 
obtained in [7], and it is very close to the NCBI refer- 
ence taxonomy of the same organisms, shown on the left 
of Figure 6. More precisely, if, for instance, we consider 
the Robinson-Foulds distance on phylogenetic trees [29], 
it is evident that the NCBI taxonomy tree shares four 
of its five clusters with the phylogenetic tree derived by 
using MP- Align. The only cluster that is not present in 
the phylogenetic tree is {cpe, lin}. 

We repeated the test considering just one pathway, the 
Glycolysis pathway, and also considering randomly chosen 
subsets of 10, 20 and 30 pathways. The resulting phylo- 
genetic trees are shown in Figure 7: they do not recover 
exactly the phylogeny of the original test, but they all dis- 
tinguish Bacteria, Archaea and Animals, and in this sense 
they confirm the validity of the adopted average score 
and the robustness of the obtained phylogeny. Actually, 
the phylogenetic tree resulting from the 30 randomly cho- 
sen pathways perfectly characterizes the Bacteria into two 
distinct clusters, {cpejin} and {hin, nme}, as in the NCBI 
taxonomy. 

Therefore, we can conclude that the similarity score pro- 
vided by MP- Align can reconstruct robust phylogenies 
that are meaningful and very close to the NCBI reference 
taxonomy. 



Metabolic pathway alignment 

Several tests were performed to evaluate our alignment 
tool, some of which were taken from [30]. As explained 
in [30], an example in favor of the so-called patchwork 
evolution model is the Urea Cycle, which, in terrestrial ani- 
mals, clearly evolved by adding a new enzyme, Arginase, to 
a set of four enzymes previously involved in the biosynthe- 
sis of Arginine [31]. Therefore, we considered the KEGG 
pathway Arginine and proline metabolism for Homo Sapi- 
ens (hsa), Anolis Carolinensis (acs), and M, Jannaschii 
(mja) and performed their alignment using MP- Align. 
Since M. Jannaschii belongs to the Archaea domain, the 
Arginase enzyme is not present in its pathway and urea 
is not synthesized. Instead, the reptile A, Carolinensis and 
the mammal H, Sapiens share part of the Urea Cycle, 
As a result, we learned that MP-Align can recognize 
the identical parts of the Urea Cycle when comparing H, 
Sapiens and A. Carolinensis and finds a mismatch when 
comparing H. Sapiens and M. Jannaschii, Table 1 shows 
the reaction path alignment obtained by MP-Align when 
considering the Arginine and proline metabolism for H, 
Sapiens and M, Jannaschii, Note that the highest score is 
about 0.836, which corresponds to the reaction path align- 
ment starting at N-Acetyl-L-citrulline for M, Jannaschii 
and at N-Acetylornithine for H, Sapiens and both ending 



-wja 



-afu 



hin 



nme 



-hin 



Iln 



cpe 



-lin 



mja 



-cpe 



afu 



-wmu 



mmu 



rno 



-rno 



Figure 6 NCBI taxonomy of the eight organisms considered (left) and phylogenetic reconstruction obtained by MP-Align using the 
average score (right). 
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10 random pathways 
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Figure 7 Phylogenetic reconstruction obtained by MP-Align for the Glycolysis patiiway (top left) and for randomly cliosen subsets of tlie 
common patiiways of tiie selected organisms: 10 pathways (top right), 20 pathways (bottom left) and 30 pathways (bottom right). 



at Spermine, where the ^-Alanine metabolism or the Glu- 
tathione metabolism is reached. Thus, in its first step, 
MP-Align is able to recognize and align the longest path 
that both organisms share. Moreover, Table 4 and Table 5 
show the reaction matchings obtained by MP-Align when 
reconsidering the Arginine and proline metabolism for H, 
Sapiens, A, Carolinensis and M. Jannaschii, The reactions 
that appear in the Urea Cycle are shown in boldface. The 
reaction catalyzed by the Arginase enzyme, R005S1 in 
KEGG nomenclature, only appears when considering H. 
Sapiens and A. Carolinensis (see Table 5). Instead, when 
considering H. Sapiens and M. Jannaschii, ROOSSl is not 
aligned (see Table 4): the reactions in boldface are in the 
upper part of the Urea Cycle but the cycle is incomplete. 
Table 6 shows the final alignment between H, Sapiens and 
A, Carolinensis: the enzymes catalyzing the reactions are 
listed for easy recognition in the KEGG pathway map. 
It is evident that all the reactions in the Urea Cycle (in 
boldface) are conserved, and the whole cycle is correctly 
aligned. Table 2 shows the final alignment between H. 
Sapiens and M, Jannaschii: note that reaction 7^00551 is 
not aligned and, consequently, the Urea Cycle is not a 
common conserved subpathway. 

To complete the validation of MP-Align, an attempt was 
made to compare it to the SubMAP alignment tool [18]^. 
This comparison was limited by the fact that the SubMAP 
utility required to translate KEGG pathways into the 



Table 4 Reaction matching hsa00330-mja00330 



hsa Reactions 


< 


— > 


mja Reactions 


R01253 


< 


— > 


R03187 


R01251 


< 


— > 


R02649 


R00670 


< 


— > 


R02282 


R00667 


< 


— > 


R02282rev 


R01954 


< 


— > 


R01954 


R00708 


< 


— > 


R03443 


R02894 


< 


— > 


R00253 


R02869 


< 


— > 


R02869 


R00245rev 


< 


— > 


R02283 


R01157 


< 


— > 


Ron 57 


R01086 


< 


— > 


R01086 


R00565rev 


< 


— > 


R00259 


R01398 


< 


— > 


R01398 


R00667rev 


< 


— > 


R00669 


R00669 


< 


— > 


R09107 


R00178 


< 


— > 


R00178 


R01920 


< 


— > 


R01920 


R00135 


< 


— > 


R03187rev 


R05051 


< 


— > 


R05052 


R00566 


< 


— > 


R00566 



Reactions appearing in the Urea Cycle are highlighted in boldface. 
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Table 5 Reaction matching acs00330-hsa00330 



acs Reactions 


< — > 


hsa Reactions 


R01252 


< — > 


R01252 


R00557rev 


< — > 


R01954 


R01253 


< — > 


R01253 


R01154 


< — > 


R01154 


R00565 


< — > 


R00565 


R00248rev 


< — > 


R00245 


R04025 


< — > 


R04025 


R00239 


< — > 


R00239 


R01992 


< — > 


R01992 


R03313 


< — > 


R03313 


R05050 


< — > 


R05050 


R01251 


< — > 


R01251 


R05052 


< — > 


R05052 


R00670 


< — > 


R00670 


R00667 


< — > 


R00667 


R00551 


< — > 


R00551 


R02894 


< — > 


R02894 


R04221 


< — > 


R04221 


R00248 


< — > 


R00248 


R02869 


< — > 


R02869 


R00256 


< — > 


R00256 


R00149 


< — > 


R00149 


R01991rev 


< — > 


R01991rev 


R02869 


< — > 


R02869 


R03293 


< — > 


R03293 


R02549 


< — > 


R02549 


R00558 


< — > 


R00558 


R05051 


< — > 


R05051 


R00565 


< — > 


R00565rev 


R00253 


< — > 


R00253 


R01991 


< — > 


R01991 


R01398 


< — > 


R01398 


R00667 


< — > 


R00667 


R00669 


< — > 


R00669 


R01992 


< — > 


R01992 


R00178 


< — > 


R00178 


R01920 


< — > 


R01920 


R00135 


< — > 


R00135 


R00557 


< — > 


R00557 


R01881 


< — > 


R01881 


R01251rev 


< — > 


R01251rev 


R03295 


< — > 


R03295 


R00566 


< — > 


R00566 


R01989 


< — > 


R01989 


R001 1 1 


< — > 


R001 1 1 


R01883 


< — > 


R01883 



Reactions appearing in tlie Urea Cycle are liigliliglited in boldface. 



Table 6 Final alignment hsa00330-acs00330 



hsa-acs alignment 


< — > 


Enzyme 


R00557rev 


< — > 


eel. 14.1 3.39 


R01398 


< — > 


ec:2.1.3.3 


R00557 


< — > 


ec:1. 14.13.19 


R00670 


< — > 


ec:4.1.1.17 


R00670 


< — > 


ec:4.1.1.17 


R00248 


< — > 


eel. 4. 1.3 


R00256 


< — > 


ee3.5.1.38 


R03313 


< — > 


eel. 2. 1.41 


R00565rev 


< — > 


ee2. 1.4.1 


R00551 


< — > 


ee3.5.3.1 


R00566 


< — > 


ee4.1.1.19 


R00558 


< — > 


eel. 14.13.39 


R00565 


< — > 


ee2. 1.4.1 


R04025 


< — > 


ec:l. 4.3.4 


R00178 


< — > 


ec:4.1.1.50 


R02869 


< — > 


ee2.5.1.16 


R02869 


< — > 


ee2.5.1.22 


R00239 


< — > 


ec:2.7.2.17 


R00248rev 


< — > 


eel. 4. 1.3 


R01883 


< — > 


ee2.1.1.2 


Room 


< — > 


eel. 14.1 3.39 


R00667rev 


< — > 


ee2.6.1.13 


R00669 


< — > 


ee3.5.1.14 


R01920 


< — > 


ee2.5.1.16 


R01154 


< — > 


ee2.3.1.57 


R00149 


< — > 


ee6.3.4.16 


R01881 


< — > 


ee2.7.3.2 


R00253 


< — > 


ee6.3.1.2 


R05050 


< — > 


eel. 2.1. 3 



Reactions appearing in the Urea Cycle are highlighted in boldface. 



SubMAP input formalism is no longer available. Our anal- 
ysis had to rely on previously translated pathways, namely 
the Arginine and proline metabolism pathway for H. Sapi- 
ens, S. Cerevisiae and C. Elegans, 

Focusing once again on the Urea Cycle of the selected 
organisms, we observed that K Sapiens and S, Cerevisiae 
share the same Urea Cycle, while urea is not synthesized in 
C Elegans. We performed the alignment between H, Sapi- 
ens and S, Cerevisiae and H, Sapiens and C Elegans using 
both MP- Align and SubMAR As shown in Table 7, both 
tools were able to correctly match the reactions involved 
in the Urea Cycle (highlighted in boldface) in H, Sapiens 
and S, Cerevisiae, However, when considering the com- 
plete reaction matching done by both tools, it is clear that 
they perform quite differently: the MP-Align reaction 
matching appears to be more thorough. 
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Table 7 hsa00330-sce00330: reaction matching obtained by SubMAP and MP-Align 



SubMAP - reaction matching MPAIign - reaction matching 



hsa00330 


< — > 


sce00330 


hsa00330 


< — > 


sce00330 


R00243 


< — > 


R00243 


R01992 


< — > 


R00774#rev 


R00245 


< — > 


R00245 


R00239 


< — > 


R00239 


R00248 


< — > 


R00248 


R00670 


< — > 


R00248 


R00551 


< — > 


R00551 


R01086 


< — > 


R01086 


R00667 


< — > 


R00667 


R01251 


< — > 


R01251 


R00707 


< — > 


R00707 


R05052 


< — > 


R05052 


R00708 


< — > 


R00708 


R00565#rev 


< — > 


R02283 


R01086 


< — > 


R01086 


R00667 


< — > 


R00667 


R01248 


< — > 


R01248 


R01954 


< — > 


R01954 


R01251 


< — > 


R01251 


R00551 


< — > 


R00551 


R01253 


< — > 


R01253 


R02894 


< — > 


R02922 


R01398 


< — > 


R01398 


R00248 


< — > 


R00243 


R01954 


< — > 


R01954 


R00708 


< — > 


R00708 


R03293 


< — > 


R03291 


R02869 


< — > 


R02869 


R03295 


< — > 


R03293 


R00253 


< — > 


R00253 


R03646 


< — > 


R03646 


R00256 


< — > 


R04445 


R03661 


< — > 


R03661 


R02869 


< — > 


R02869 


R04444 


< — > 


R04444 


R00248#rev 


< — > 


R00248#rev 


R04445 


< — > 


R04445 


R01157 


< — > 


R00670 


R05051 


< — > 


R05051 


R02549 


< — > 


R00774 


R05052 


< — > 


R05052 


R00245#rev 


< — > 


R00259 








R05051 


< — > 


R05051 








R05051 


< — > 


R05051 








R00667#rev 


< — > 


R00667#rev 








R03313 


< — > 


R03313 








R04221 


< — > 


R03293 








R04025 


< — > 


R03443 








R00669 


< — > 


R00243#rev 








R00178 


< — > 


R00178 








R01398 


< — > 


R01398 








R01920 


< — > 


R01920 








R00135 


< — > 


R02649 








R00245 


< — > 


R00245 








R00557#rev 


< — > 


R02282#rev 








R01881 


< — > 


R00005 








R00557 


< — > 


R00245#rev 








R01251#rev 


< — > 


R01251#rev 








R00566 


< — > 


R02282 








R01991#rev 


< — > 


R05050 








R01989 


< — > 


R03180 








R01253 


< — > 


R01253 



Reactions appearing in the Urea Cycle are highlighted in boldface. 
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Concerning the alignment between H. Sapiens and C. 
Elegans the two tools performs differently. As evident 
in Table 8, SubMAP matches reaction R0056S (catalyzed 
by enzyme 2.1.4.1) with reaction R00554 (catalyzed by 
enzyme 2.7.3.3), although the two reactions belong to 
different parts of the pathways. The wrong match is high- 
lighted in boldface. In the matching provided by MP- 
Align, however, reaction R00S6S is not matched, so it is 
not included in the final alignment of the two pathways. 

This test revealed a difference between the two tools, 
which became evident when reporting their matchings 
back to the corresponding KEGG maps. The matchings 
computed by SubMAP allow the alignment of individ- 
ual reactions, or small groups of reactions, without con- 
sidering the topology of the whole pathway. MP-Align 



takes the entire network topology into account in the 
final alignment, thereby identifying the largest connected 
subpathway. 

Conclusions 

This paper presents a new methodology and tool for the 
pairwise comparison and alignment of metabolic path- 
ways. The methodology is based on a hypergraph repre- 
sentation of metabolic pathways and defines a reaction 
similarity score based on enzyme and compound similar- 
ities. The proposed alignment technique uses the adopted 
reaction similarity score as well as the pathway topology 
to identify the largest conserved subpathway between the 
two given pathways. 



Table 8 hsa00330-cel00330: reaction matching obtained by SubMAP and MP-Align 



SubMap - reaction matching MPAIign - reaction matching 



hsa00330 


< — > 


cel00330 


hsa00330 


< — > 


cel00330 


R00243 


< — > 


R00243 


R01991#rev 


< — > 


R04445 


R00245 


< — > 


R00245 


R00239 


< — > 


R00239 


R00248 


< — > 


R00248 


R05052 


< — > 


R05052 


R00565 


< — > 


R00554 


R01251 


< — > 


R01251- 


R00667 


< — > 


R00667 


R00670 


< — > 


R00670 


R00707 


< — > 


R00707 


R02894 


< — > 


R02894 


R00708 


< — > 


R00708 


R00248- 


< — > 


R00248 


R01248 


< — > 


R01248 


R00708 


< — > 


R00708 


R01251 


< — > 


R01251 


R00256 


< — > 


R00256 


R01252 


< — > 


R01252 


R02869 


< — > 


R02869 


R01253 


< — > 


R01253 


R00248#rev 


< — > 


R00248#rev 


R02894 


< — > 


R02894 


R00557 


< — > 


R00554 


R03293 


< — > 


R03291 


R01989 


< — > 


R03293 


R03295 


< — > 


R03293 


R00245#rev 


< — > 


R00245#rev 


R03646 


< — > 


R03646 


R05051 


< — > 


R05051 


R03661 


< — > 


R03661 


R05051 


< — > 


R05051 


R04221 


< — > 


R04221 


R00253 


< — > 


R00253 


R04444 


< — > 


R04444 


R00667#rev 


< — > 


R00667#rev 


R04445 


< — > 


R04445 


R03313 


< — > 


R03313 


R05051 


< — > 


R05051 


R04221 


< — > 


R04221 


R05052 


< — > 


R05052 


R00565#rev 


< — > 


R00669 








R00178 


< — > 


R00178 








R01920 


< — > 


R01920 








R00245 


< — > 


R00245 








R00667 


< — > 


R00667 








R00669 


< — > 


R05050 








R00135 


< — > 


R01251#rev 








R01253 


< — > 


R01253 



A wrong match is highlighted in boldface. 
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We used our tool MP- Align to perform several tests 
to validate the proposed similarity score and alignment 
algorithm. The first was a comparative analysis test 
that showed that our approach allows for discriminating 
between different domains. The second was a phyloge- 
netic reconstruction test that showed that, by considering 
all the common pathways of eight specific organisms, 
our approach makes it possible to recover a robust phy- 
logeny that is very close to the NCBI reference taxonomy 
of those organisms. The last was an alignment test that 
showed that our alignment algorithm correctly identifies 
subpathways sharing a common biological function. 

Finally, we performed a comparison between MP- Align 
and the SubMAP alignment tool [18]. The two tools seem 
to have been designed for different purposes: SubMAP 
looks for small conserved substructures while MP- Align 
identifies the largest conserved subpathway. 

Endnote 

^SubMAP allows the matching between reactions to be 
either one-to-one (one reaction is matched to exactly one 
reaction) or one-to-many (one reaction can be matched 
to many - maximum five - reactions). We performed our 
tests using the one-to-one alternative. 
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